-
Notifications
You must be signed in to change notification settings - Fork 10
Remove bed_reader #400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove bed_reader #400
Conversation
a568103
to
36def1b
Compare
Plink tests run somewhat faster with this change! 10% or so |
Haha, well, there you go! We'll need to modularise and test more, but I think that'll be quicker than dealing with all the details about packaging. One quick question: is there a reason you didn't use pandas for the text files? I think this would be worthwhile and it would mean that we can port over the sgkit conversion functions like read_fam I'm happy with pandas as a dependency, it's pretty much universal now. |
@tomwhite - do you see any issues with dropping bed_reader and using the lookup table approach here for decoding bed? I see a lot of advantages... |
I'm pleased to see this. It's not like bed_reader supports PLINK 2, so doing this wouldn't cut off a future migration path. We probably want more tests for corner cases that bed_reader supports. Can you do the same thing for BGEN? 😄 |
Didn't want to add a dependency while removing one! If you're happy with pandas then happy to switch to it. |
No - it's too complicated for this kind of treatment unfortunately |
I'm happy to commit an initial version of this that just removes the bed_reader dep @benjeffery, and we can log issues for porting in the sgkit auxiliary file reading code and adding more tests for the data reading. |
While writing this, I noticed that we're not storing the plink "family ID" (or parent ids) anywhere. Should those be included in the zarr? |
Leaving them out for now, there's an issue open to track |
I'll follow up on this here @benjeffery, I'm going to tack on some commits to move in the sgkit parsers and tests and will modularise the reader. |
b5b56f6
to
5f580a5
Compare
I've added some tests that generates a bunch of BED files using bed_reader, and I think it's looking solid. I haven't looked at performance at all - will need to source a big plink tomorrow and try it out. |
5f580a5
to
08ceb74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks for tidying up my rushed prototype!
if magic != b"\x6c\x1b\x01": | ||
raise ValueError("Invalid BED file magic bytes") | ||
|
||
# We could check the size of the bed file here, but that would |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about streams - I guess there is no way to know if the user has inconsistent bim/fam/bed files, some combinations of which would just give silently corrupted data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's just how it is in the real world. There's no point in ruling out useful functionality just to do checking that other people don't bother with anyway
Fixes #397